-
Notifications
You must be signed in to change notification settings - Fork 1k
Support writing GeospatialStatistics in Parquet writer #8524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @paleolimbot, looks pretty good on a first pass. I just want to make sure that the size statistics are written properly when geo stats are enabled.
parquet/src/column/writer/encoder.rs
Outdated
if let Some(var_bytes) = T::T::variable_length_bytes(slice) { | ||
*self.variable_length_bytes.get_or_insert(0) += var_bytes; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should execute regardless of whether geo stats are enabled. The variable_length_bytes
are ultimately written to the SizeStatistics
which are useful even without min/max statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
parquet/tests/geospatial.rs
Outdated
drop(file_writer); | ||
|
||
// Check that statistics exist in thrift output | ||
thrift_metadata.row_groups[0].columns[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heads up that when the thrift stuff merges this will no longer be a format::FileMetaData
but file::metadata::ParquetMetaData
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I removed these assertions so that they won't break when the thrift stuff merges (although there will be a few logical type constructors that will need to be updated).
Thank you for the review! I will clean this up on Monday and add a few more tests. |
@paleolimbot I took a stab at resolving the merge conflicts. They are mostly trivial, but I wasn't sure how to resolve the tests. I'll leave that up to you 😄. |
🤖 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @paleolimbot and @etseidl
I reviewed this PR for test coverage and structure, and from my perspective it is good to go. I had a few minor comments / suggestions, but nothing I think would prevent merging
} | ||
} | ||
|
||
/// Explicitly specify the Parquet schema to be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a nice API addition I think
/// ``` | ||
#[derive(Clone, Debug, PartialEq, Default)] | ||
pub struct GeospatialStatistics { | ||
/// Optional bounding defining the spatial extent, where None represents a lack of information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why remove these comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved them to the accessor methods in a previous change...I'm not sure why they're showing up in this diff. My theory was that they'd be more likely to be read there but I don't mind copying them back.
fn new_accumulator(&self, descr: &ColumnDescPtr) -> Box<dyn GeoStatsAccumulator>; | ||
} | ||
|
||
/// Dynamic [`GeospatialStatistics``] accumulator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a nice API for optional statistics encoding
# Enable parquet variant support | ||
variant_experimental = ["arrow", "parquet-variant", "parquet-variant-json", "parquet-variant-compute"] | ||
# Enable geospatial support | ||
geospatial = ["parquet-geospatial"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also add the new feature flag to the main crate readme as well?
https://github.com/apache/arrow-rs/blob/main/parquet/README.md#feature-flags
🤖: Benchmark completed Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks @paleolimbot.
One question I have has to deal with the column chunk Statistics
and the column index. Am I correct that if geo stats are written, the column chunk stats should be None? And should the column index for such a column also be None? If so, could you add a test that verifies this? 🙏 Could be in a later PR.
Co-authored-by: Ed Seidl <[email protected]>
The min/max value should be absent but the null count should still be there. I added a test!
I actually have no idea what a column index, which suggests to me that it should be None 🙂 |
It's a version of the page statistics available without having to parse the individual page headers. It has the unfortunate(*) property that min and max are mandatory, so if either min or max are (*) Unfortunate because there is other information in the column index beyond min and max statistics that can still be of use for page pruning. Null pages and level histograms among them. |
I think this is ready to merge. @alamb have your concerns been addressed? |
pub fn wkb_point_xy(x: f64, y: f64) -> Vec<u8> { | ||
let mut item: [u8; 21] = [0; 21]; | ||
item[0] = 0x01; | ||
item[1] = 0x01; | ||
item[5..13].copy_from_slice(x.to_le_bytes().as_slice()); | ||
item[13..21].copy_from_slice(y.to_le_bytes().as_slice()); | ||
item.to_vec() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a huge deal for XY and XYZM points, but if we want more complex helpers for more complex geometries, I think it would be more maintainable and more understandable for future people to use an existing crate to generate the WKB buffers. (In my own projects I use wkt::types
as simple extended-dimension geometry types that I pass into wkb
APIs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely! The GeometryBounder
is tested with those here (where we have wkt as a dev dependency). I don't mind how these are implemented (I just needed something for the parquet/geospatial tests).
The only thing I want to make sure is that this doesn't impact writing performance. The benchmark results above seem to suggest there might be.
However, I tried to reproduce this locally and it looks fine to me. cargo bench --bench arrow_writer -- "list_primitive/parquet_2" list_primitive/parquet_2
time: [994.65 µs 1.0011 ms 1.0083 ms]
thrpt: [2.0649 GiB/s 2.0797 GiB/s 2.0931 GiB/s]
change:
time: [+0.0649% +2.0194% +3.9738%] (p = 0.04 < 0.05)
thrpt: [-3.8219% -1.9795% -0.0649%]
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe |
Three approvals so let's get this one in and we can iterate if necessary in follow on PRs! |
Thanks again @paleolimbot @etseidl and @kylebarron ! |
🤖 |
🤖: Benchmark completed Details
|
Which issue does this PR close?
Rationale for this change
One of the primary reasons the GeoParquet community was excited about first-class Parquet Geometry/Geography support was the built-in column chunk statistics (we had a workaround that involved adding a struct column, but it was difficult for non-spatial readers to use it and very difficult for non-spatial writers to write it). This PR ensures it is possible for arrow-rs to write files that include those statistics.
What changes are included in this PR?
This PR inserts the minimum required change to enable this support.
Are these changes tested?
Yes!
Are there any user-facing changes?
There are several new functions (which include documentation). Previously it was difficult or impossible to actually write Geometry or Geography logical types, and so it is unlikely any previous usage would be affected.